Lab 10: Statistics

Analysis of RCT (Randomized Control Trial) Data

Author

Bogdan G. Popescu

1 Introduction

In this hypothetical situation (the data is made up), the World Bank is planning on launching a program designed to reduce the risk of malaria in Malawi by providing insecticide-treated bed nets. So they have funding to run a randomized controlled trial (RCT) in three districts in Malawi that are very close to the Malawi lake. The World Bank randomly selected 1000 individuals from these districts and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.

In this document, we want to calculate \(E(\text{Post-Malaria Risk | Net})\)

library("broom")
library("dplyr")
library("ggplot2")
library("ggrepel")
library("modelsummary")
library("ggdag")
library("purrr")
library("stringr")
library("ggpubr")
library("ggpattern")


# For Maps
library("sf")
library("stars")
library("sp")
library("rnaturalearth")
library("rnaturalearthdata")
library("ggspatial")
library("dichromat")

# For interactive maps
library("leaflet")
library("leaflegend")
library("htmltools")

# Removing previous files and setting the
# directory
rm(list = ls())
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/lab10/")

In a first instance, we want to draw a causal model which explains the effect of the insecticide-treated bed nets on participant risk of malaria, given the confounding caused by age, sex, and income. We think that the risk of malaria will be impacted by age, sex, income, prior risk of malaria, and whether people used the insecticide-treated net. See the models below (together with the code to produce them).

Step 1 Code
#Basic structure of a dag
malaria_dag <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
                     net ~ income + pre_malaria_risk,
                     exposure = "net",
                     outcome = "post_malaria_risk",
                     labels = c(post_malaria_risk = "Post Malaria Risk",
                                net = "Mosquito Net",
                                age = "Age",
                                sex = "Sex",
                                income = "Income",
                                pre_malaria_risk = "Pre Malaria Risk"),
                     coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
                                         sex = 4, pre_malaria_risk=6),
                                   y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
                                         sex = 4, pre_malaria_risk=4)))

#Cleaning the dags and turning it into a dataframe.
bigger_dag <-data.frame(tidy_dagitty(malaria_dag))
#Labeling variables
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
#Identifying min and max coordinate
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)
Ctep 2 Code
#Defining colors
col = c("Outcome"="green3",
        "Intervention"="red",
        "Confounder"="grey60")

#Defining order
order_col<-c("Outcome", "Intervention", "Confounder")

#Producing the graph
example_dag<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
  geom_dag_point() +
  geom_dag_edges() +
  coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), ylim = c(min_lat_y-1, max_lat_y+1), expand = FALSE)+
  scale_colour_manual(values = col,
                      name = "Group",
                      breaks = order_col)+
  geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)), 
                   aes(label = label), fill = alpha(c("white"),0.8))+

  theme_bw()+
  labs(x = "", y="")

#Saving the ggplot
ggsave(example_dag, file = "./graphs/example_dag1.jpg", 
       height = 20, width = 25, 
       units = "cm", dpi = 200)

example_dag

In the experiment, we want to manipulate access to the program (providing nets). Thus, we can calculate \(E(\text{malaria risk | net})\). Following the rules of causal diagrams, we can delete all the arrows going into the program node, because the nets are randomly assigned.

Step 3 Code
#Basic structure of a dag
malaria_dag2 <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
                     exposure = "net",
                     outcome = "post_malaria_risk",
                     labels = c(post_malaria_risk = "Post Malaria Risk",
                                net = "Mosquito Net",
                                age = "Age",
                                sex = "Sex",
                                income = "Income",
                                pre_malaria_risk = "Pre Malaria Risk"),
                     coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
                                         sex = 4, pre_malaria_risk=6),
                                   y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
                                         sex = 4, pre_malaria_risk=4)))
Step 4 Code
bigger_dag <-data.frame(tidy_dagitty(malaria_dag2))
bigger_dag$type<-NA
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)
Step 5 Code
#Defining colors
col = c("Outcome"="green3",
        "Intervention"="red",
        "Confounder"="grey60")

#Defining order
order_col<-c("Outcome", "Intervention", "Confounder")


#Producing the graph
example_dag2<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
  geom_dag_point() +
  geom_dag_edges() +
# geom_text(data = bigger_dag, aes(x = x, y = y, label = label), 
#            hjust=0.4, vjust=0, color = "black")+
  coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), ylim = c(min_lat_y-1, max_lat_y+1), expand = FALSE)+
  scale_colour_manual(values = col,
                      name = "Group",
                      breaks = order_col)+
  geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)), 
                   aes(label = label), fill = alpha(c("white"),0.8))+
  theme_bw()+
  labs(x = "", y="")

#Saving the file
ggsave(example_dag2, file = "./graphs/example_dag2.jpg", 
       height = 20, width = 25, 
       units = "cm", dpi = 200)
#Displaying the file
example_dag2

2 Loading the Data

Create a folder called lab10. Within it, create another folder and call it data. Go to the following links and download the following:

Place all the data in the data foler.

3 Inspecting the Area

Let’s assume that the experiment has been conducted and the data has come in. Let us first have a first look at the area where the World Bank ran this RCT.

3.1 Loading the Shape Files

st_read reads the shapefiles from your hardrive. read_stars reads a tif file that shows the elevation for your area of interest. ne_countries loads a shapefile with all the countries around the world.

#Country shapefile
mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
#Region 1 shapefile
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
#Region 3 shapefile
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
#Selected communes shapefile
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
#Raster with elevation
elev <- read_stars("./data/elev.tif")
#World Shapefile
world <- ne_countries(scale = "medium", returnclass = "sf")
#Switched off spherical geometry (s2)
sf::sf_use_s2(FALSE)

3.2 Simplifying the shapefiles for better rendering

st_simplify simplifies the polygons for quicker rendering.

#Simplifying polygons for quicker rendering
#Note: you have to play around with dTolerance to figure out the appropriate value
#Too little - your computer will need more time to render the file
#Too much - the country polygons will be too simple
mwi<-st_simplify(mwi,  dTolerance = 0.005)
mwi1<-st_simplify(mwi1,  dTolerance = 0.005)
mwi3<-st_simplify(mwi3,  dTolerance = 0.005)
select_comm<-st_simplify(select_comm,  dTolerance = 0.0005)
world<-st_simplify(world,  dTolerance = 0.005)

3.3 Calculating the appropriate zoom level for Malawi

We now calculate the zoom level for the entire country of Malawi.

#Calculating the appropriate zoom level
mwi_extent <- st_bbox(mwi)

min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]

min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]

We now create a new polygon to emphasize where Malawi is on the World Map.

#Creating a dataframe to emphasize Malawi and adding some error
error = 5

sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error, 
                             min_lon_x - error, max_lon_x + error),
                   "lat" = c(min_lat_y - error, min_lat_y - error, 
                             max_lat_y + error, max_lat_y + error))

We now make a spatial polygon out of it so that we can highlight where Malawi is on the world map.

#Creating polygon to indicate zoom
poly <- sp_df %>% 
  st_as_sf(coords = c("lon", "lat"), 
           crs = st_crs(mwi)) %>% 
  st_bbox() %>% 
  st_as_sfc()

#Examining what we just created
ggplot() +
  geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.9)+
  geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.3)

3.4 Cropping the raster

We now crop our raster only to the area of interest. Again, if we have too much, our computer will have a hard time rendering the file.

#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(elev, st_as_stars(st_bbox(elev), dx = 0.05, dy = 0.05))
#Step2: Removing really high pixel values for good contrast in MWI
elev2[elev2 > 2000] <- NA
#Step3: Cropping
dem_mwi<-st_crop(elev2, poly)
map_word<-ggplot() +
  geom_sf(data = world, fill=NA, color = "black", linewidth = 0.3)+
  geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.7)+
  geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.9)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("Malawi on the World Map")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))
#ggspatial::annotation_scale(location = 'tr')
map_word

ggsave(map_word, file="./graphs/map_world.jpg", height=15, width=30, units = "cm", dpi=300)

3.5 Mapping the region of interest - Malawi

#Selecting only the region of interest
mwi1_lab<-subset(mwi1, NAME_1=="Nkhata Bay")
#Defining colors for raster
grey_palette <- colorRampPalette(c("grey90", "grey30"))
# Generate a sequence of colors based on the number of breaks desired
num_breaks <- 10  # Adjust this based on your data
grey_colors <- grey_palette(num_breaks)

This is the region of Nkhata Bay where the World Bank ran the RCT. Within the region of Nkhata Bay, the World Bank chose three settlements close to the lake where people could be exposed to malaria.

We now want to draw a new map to emphasize where Nkhata Bay is.

error<-0.2
map_malawi<-ggplot() +
  geom_stars(data=elev2, show.legend=FALSE) +
  scale_fill_gradientn(colours = grey_colors)+
  geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  geom_sf_text(data=mwi1_lab, aes(label = NAME_1), size=3.5)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("Nkhata Bay in Malawi")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
    coord_sf(xlim = c(min_lon_x-5*error, max_lon_x+5*error), ylim = c(min_lat_y-error, max_lat_y+error))+
    ggspatial::annotation_scale(location = 'tr')
#Saving the map
ggsave(map_malawi, file="./graphs/map_malawi.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi

3.6 Calculating the appropriate zoom level for the Selected Communes

We now calculate the zoom level for the region of interest - Nkhata Bay.

#Calculating the appropriate zoom level for the selected commune
mwi_extent <- st_bbox(select_comm)

min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]

min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]
#Re-reading the shapefiles.
#We now want more accurate polygons since we are zooming in.
#This is why we load them again

mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
dem <- read_stars("./data/elev.tif")

#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(dem, st_as_stars(st_bbox(elev), dx = 0.02, dy = 0.02))
#Step2: Removing really high pixel values for good contrast in MWI
elev2[elev2 > 2000] <- NA

#Step3: Creating a dataframe to emphasize Malawi and adding some error
error = 0.5
sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error, 
                             min_lon_x - error, max_lon_x + error),
                   "lat" = c(min_lat_y - error, min_lat_y - error, 
                             max_lat_y + error, max_lat_y + error))

#Step4: Creating polygon to indicate zoom
poly <- sp_df %>% 
  st_as_sf(coords = c("lon", "lat"), 
           crs = st_crs(mwi)) %>% 
  st_bbox() %>% 
  st_as_sfc()

#Step5: Cropping
dem_mwi<-st_crop(elev2, poly)

world <- ne_countries(scale = "medium", returnclass = "sf")
#Turning off spherical geometries
sf::sf_use_s2(FALSE)
#We also want to include roads in our map, to provide more context.
#Reading the OSM roads data
roads<-st_read("./data/malawi-latest-free/gis_osm_roads_free_1_crop2.shp")
#Step1: Dissolving polylines
roads2 <- st_combine(roads)
#Step2: Turning roads2 into an sf object
roads3<-st_as_sf(roads2)
#Step3: Only the roads of interest
roads_crop1 <- sf::st_intersection(roads3, select_comm)

3.7 Mapping the Individuals

Let us first look at our data.

rct_data <- read.csv(file = './data/rct_data_geo_malawi.csv')
glimpse(rct_data, n=10)
Rows: 1,000
Columns: 10
$ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ lat               <dbl> -11.27476, -11.15300, -11.13967, -11.25544, -11.1777…
$ lon               <dbl> 34.03006, 34.14594, 34.14297, 34.14619, 34.20031, 34…
$ income            <dbl> 409.4701, 520.8072, 581.3331, 324.0727, 532.1844, 53…
$ age               <int> 10, 35, 7, 43, 45, 51, 24, 17, 38, 69, 10, 21, 32, 3…
$ sex               <chr> "Man", "Woman", "Woman", "Woman", "Man", "Woman", "M…
$ health            <dbl> 82.78928, 81.18602, 80.57399, 82.84023, 87.40737, 57…
$ net               <int> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1…
$ malaria_risk_pre  <dbl> 35.74454, 36.65260, 22.87415, 35.41934, 27.49873, 42…
$ malaria_risk_post <dbl> 42.52199, 34.27589, 32.67552, 43.87256, 27.16945, 41…
error<-0.05
map_malawi_exp<-ggplot() +
  #Plotting the elevation raster
  geom_stars(data=elev2, show.legend=FALSE) +
  #Making elevation grey
  scale_fill_gradientn(colours = grey_colors)+
  #Plotting the selected commune
  geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
  geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  #Plotting the roads
  geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
  #Plotting the RCT data
  geom_point(data = rct_data, aes(x = lon, y= lat), color = "red", size=0.3)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
        coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error), 
                 ylim = c(min_lat_y-error, max_lat_y+error))+
  #Adding a scale to the map
  ggspatial::annotation_scale(location = 'tr')

#Saving the picture in folder one your machine called "graphs"
ggsave(map_malawi_exp, file="./graphs/map_malawi_exp.jpg", height=15, width=15, units = "cm", dpi=300)

map_malawi_exp

3.8 Mapping the Individuals in an Interactive Way

Here is how we can make an interactive map.

rct_data$treat<-NA
rct_data$treat[rct_data$net==1]<-"Treatment"
rct_data$treat[rct_data$net==0]<-"Control"


#name <- colorFactor(topo.colors(3), domain = carrefour$name) 
pal <- 
   colorFactor(palette = c("#8ED5D8", "#F2B6B2"), 
               levels = c("Control", "Treatment"))


leaflet(rct_data) %>% addTiles() %>%
addPolygons(data = select_comm,
                group = "Nkhata Bay",
    # stroke parameters
    weight = 3, color = "blue",
    # fill parameters
    fillColor = "blue", fillOpacity = 0.0)%>%
    
  addCircleMarkers(lng = ~lon, lat = ~lat, 
             label = ~htmlEscape(paste("Indiv.: ", id)), radius = 4, color = ~pal(treat))%>%
  addLegendFactor(
    pal=pal,
    values = rct_data$treat,
    position = 'topright',
    shape = 'circle',
    width = 10,
    height = 10,
    opacity = 0.5)%>%
  addScaleBar(
  position = c("bottomright"))

4 Data Analysis

We will now do the data analysis. Until now, we looked at various ways of mapping our data.

4.1 Summary statisics

Before we conduct any analysis, it is important to get a sense of our data, the minimum and maximum, the mean, and whether our variables are normally distributed or not. Let us now create a summary statistics table:

datasummary_skim(rct_data)
Unique (#) Missing (%) Mean SD Min Median Max
id 1000 0 500.5 288.8 1.0 500.5 1000.0
lat 1000 0 −11.2 0.1 −11.3 −11.2 −11.1
lon 1000 0 34.1 0.1 34.0 34.1 34.2
income 1000 0 498.0 74.8 245.3 497.0 739.7
age 76 0 28.9 16.1 1.0 27.0 89.0
health 1000 0 69.1 12.9 30.9 69.0 100.0
net 2 0 0.5 0.5 0.0 1.0 1.0
malaria_risk_pre 1000 0 38.0 9.8 5.0 37.9 70.0
malaria_risk_post 1000 0 40.4 10.4 5.0 40.5 70.0

Income and malaria seem to be normally distributed. The net variable is binary, as expected. Age and health seem to be skewed to the left.

4.2 Checking balance

Before calculating the effect of the program, we should first check how well balanced assignment was. As you can tell from the results below, assignment to the program was pretty much split 50/50.

people_randomized<-rct_data %>%
  count(net) %>%
  mutate(prop = n / sum(n))
people_randomized

We also want to check to see how well balanced the treatment and control groups were in participants’ pre-treatment characteristics:

#Creating a 0-1 variable representing women
rct_data$sex_num<-as.numeric(as.factor(as.character(rct_data$sex)))
rct_data$sex_num[rct_data$sex_num==2]<-0

#Calculating the average variably by treated vs. non-treated
people_randomized<-rct_data %>%
  group_by(net) %>%
  summarize(prop_male = mean(sex_num),
    avg_age = mean(age),
    avg_malaria_risk_pre = mean(malaria_risk_pre),
    avg_health = mean(health),
    avg_income = mean(income))

people_randomized

These variables appear fairly well balanced. To check that there aren’t any statistically significant differences between the groups, we should also conduct some t-tests.

We have learned to do t.tests in the following way:

#Method1
rct_data1<-subset(rct_data, net==1)
rct_data0<-subset(rct_data, net==0)
t.test(rct_data1$sex_num, rct_data0$sex_num, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  rct_data1$sex_num and rct_data0$sex_num
t = 0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05564916  0.06597315
sample estimates:
mean of x mean of y 
0.3984674 0.3933054 

Another way to get the same result is the following:

t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.06597315  0.05564916
sample estimates:
mean in group 0 mean in group 1 
      0.3933054       0.3984674 

Thus, we can run t-test for all the following groups:

t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.06597315  0.05564916
sample estimates:
mean in group 0 mean in group 1 
      0.3933054       0.3984674 
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  age by net
t = -1.275, df = 997.37, p-value = 0.2026
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -3.2870789  0.6979223
sample estimates:
mean in group 0 mean in group 1 
       28.25523        29.54981 
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  malaria_risk_pre by net
t = 0.015818, df = 997.7, p-value = 0.9874
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.208182  1.227819
sample estimates:
mean in group 0 mean in group 1 
       37.96975        37.95993 
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  health by net
t = -0.40132, df = 996.97, p-value = 0.6883
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.931451  1.275577
sample estimates:
mean in group 0 mean in group 1 
       68.90312        69.23106 
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  income by net
t = 0.19875, df = 991.26, p-value = 0.8425
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -8.353269 10.236022
sample estimates:
mean in group 0 mean in group 1 
       498.4966        497.5552 

We can also simply ask to be shown the p-values.

t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8677374
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.2026112
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.9873827
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.6882691
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8424983

There seems to be a slight difference when it comes to malaria risk and health, where the people who are treated, have slightly greater risk of malaria. At the same time, they also seem to have greater health. The differences however do not seem to be highly statistically significant and the differences are not that great. But let us also plot these differences.

4.3 Plotting Differences

The following graphs plot the point ranges for the variables of interest.

#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
#Remember that the SE is the SD (var)/Sqrt(Sample size)

plot_diff_sex <- ggplot(rct_data, aes(x = treat, y = sex_num, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Prop. male")

plot_diff_age <- ggplot(rct_data, aes(x = treat, y = age, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Age")

plot_diff_malaria_risk_pre <- ggplot(rct_data, aes(x = treat, y = malaria_risk_pre, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Mal. Risk Bef.")


plot_diff_health <- ggplot(rct_data, aes(x = treat, y = health, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Health")

plot_diff_income <- ggplot(rct_data, aes(x = treat, y = income, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Income")

ggarrange(plot_diff_sex, plot_diff_age,
             plot_diff_malaria_risk_pre,
             plot_diff_health, plot_diff_income, ncol=2, nrow=3, common.legend = TRUE)

Let us now also include the distributions of these variables by group (treatment and control group).

plot_diff_sex_d<-ggplot(rct_data, aes(pattern = sex, x = treat)) +
  geom_bar(data = rct_data,
          position = "dodge",
          aes(fill=treat),
          alpha = 0.5) +
  geom_bar_pattern(position = "dodge",
                   aes(pattern=sex),
                   fill = NA,
                   linewidth=0.5,
                  color="black")+
  scale_pattern_manual(name = NULL,
                       values = c(
                         Man = "stripe",
                         Woman = "circle"))+
  labs(fill = NULL)+
  xlab(NULL)

plot_diff_age_d<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)



plot_diff_malaria_risk_pre_d<-ggplot(rct_data, aes(x = malaria_risk_pre, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)


plot_diff_health_d<-ggplot(rct_data, aes(x = health, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)


plot_diff_income_d<-ggplot(rct_data, aes(x = income, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 10, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 10000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/10000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)

figure1<-ggarrange(plot_diff_sex, plot_diff_sex_d, ncol=2, common.legend = TRUE)
annotate_figure(figure1, top = text_grob("Sex"))

figure2<-ggarrange(plot_diff_age, plot_diff_age_d, ncol=2, common.legend = TRUE)
annotate_figure(figure2, top = text_grob("Age"))

figure3<-ggarrange(plot_diff_malaria_risk_pre, plot_diff_malaria_risk_pre_d, ncol=2, common.legend = TRUE)
annotate_figure(figure3, top = text_grob("Malaria Risk Pre-Intervention"))

figure4<-ggarrange(plot_diff_health, plot_diff_health_d, ncol=2, common.legend = TRUE)
annotate_figure(figure4, top = text_grob("Health Score"))

figure5<-ggarrange(plot_diff_income, plot_diff_income_d, ncol=2, common.legend = TRUE)
annotate_figure(figure5, top = text_grob("Income"))

4.4 Estimating the Difference

So, we are interested in the causal effect of the program - \(E(\text{Post-Malaria Risk | Net})\). We can find this causal effect by calculating the average treatment effect or ATE:

\[ ATE = E(\overline{\text{Post-Malaria Risk}} | \text{Net = 1}) - E(\overline{\text{Post-Malaria Risk}} | \text{Net = 0}) \]

This is simply the average outcome for people in the program minus the average outcome for people not in the program. We can calculate the group averages:

people_randomized<-rct_data %>%
  group_by(net) %>%
  summarize(avg_malaria_risk_post = mean(malaria_risk_post))

That’s 45.6620984 - 35.5793607, or -10.0827377, which means that the program caused an decrease in the risk of malaria of -10.0827377, on average. We can also simply run a regression with post-program malaria risk as the outcome variable and the program indicator variable as the explanatory variable. The results below show two models: (1) in which we simply control for the treatment status and one in which we add a series of controls. We can see that we get the same effect of 10.08274, which we calculated manually. This is very close in magnitude to 10.235, which is the effect, once we control for income, age, and sex.

model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
summary(model_rct1)

Call:
lm(formula = malaria_risk_post ~ net, data = rct_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.5794  -6.4719   0.0798   6.1909  28.9747 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.6621     0.4146  110.14   <2e-16 ***
net         -10.0827     0.5738  -17.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.064 on 998 degrees of freedom
Multiple R-squared:  0.2363,    Adjusted R-squared:  0.2355 
F-statistic: 308.7 on 1 and 998 DF,  p-value: < 2.2e-16
model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
model_rct2 <- lm(malaria_risk_post ~ net + income + age + sex_num, data = rct_data)


models<-list("Risk of Malaria" = model_rct1,
             "Risk of Malaria" = model_rct2)

cm <- c('net'='Received Mosquito Net (Treatment)',
        "income" = "income",
        "age" = "age",
        "sex_num" = "sex",
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
Risk of Malaria Risk of Malaria
Received Mosquito Net (Treatment) −10.083*** −10.235***
(0.574) (0.527)
income −0.045***
(0.004)
age 0.083***
(0.016)
sex 0.518
(0.539)
Intercept 45.662*** 65.608***
(0.415) (1.866)
Num.Obs. 1000 1000
R2 0.236 0.359
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Let us now also plot this difference.

ggplot(rct_data, aes(x = malaria_risk_post, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 1500, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/1500, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)

ggplot(rct_data, aes(x = treat, y = malaria_risk_post, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Post Malaria Risk")

5 Conclusion

We can now conclude that indeed the provision of insecticide-treated nets reduces the risk of malaria by 10.083 units. Thus, if the we want to reduce such a health risk to the wider region, it might be effective to provide such nets to all vulnerable populations.

Let us recap the steps in the analysis of RCTs

  • Step 1: Check balance
  • Step 2: Calculate the difference
  • Step 3: Show the difference